﻿ Statistical technologies for NLP Diana Trandabăț Academic year 2018-2019 Introduction •Statistical vs symbolic •Statistics probabilities –Joint probabilities –Conditional probabilities –Entropy –Etc •Language models •Machine Learning Probabilities – very short reminder •Probability of having 2 on a die •Probability of having 7 on a die •Probability of having either 1 or 2 or 3 or 4 or 5 or 6 on a die •Probability of next word in: –What do you think the next … •Probability of dog and bark together in a sentence? •Probability of translating dog by câine? Statistical technologies •Language models •Collocations •Text classification •Information Retrieval •Machine Translation Statistical technologies •Language models •Collocations •Text classification •Information Retrieval •Machine Translation Language Models •Speech recognition –“Eye eight uh Jerry” or –“I ate a cherry” ? •OCR & Handwriting recognition –More probable sentences are more likely correct readings •Machine translation –More likely sentences are probably better translations •Generation –More likely sentences are probably better generations •Context sensitive spelling correction –“Their are problems wit this sentence” –“Neam cumpărat un calculator care sa defectat dea doua zi: supt multe cuvinte se pune o linie roșă care nu pot sos cot ” Counting Words in Corpora •What is a word? –e g , are cat and cats the same word? –September and Sept? –zero and oh? –Is a word? * ? ) , –How many words are there in don’t ? Gonna ? –In Japanese and Chinese text how do we identify a word? What’s a word? 天主教教宗若望保祿二世因感冒再度住進醫院。 مساب قطانلا فيجير كرام لاقو- 這是他今年第二度因同樣的病因住院。 لبق نوراش نإ ةيليئارسلإا ةيجراخلا- ةرايزب ىلولأا ةرملل موقيسو ةوعدلا رقملا ةليوط ةرتفل تناك يتلا ،سنوت ماع نانبل نم اهجورخ دعب ةينيطسلفلا ريرحتلا ةمظنمل يمسرلا1982 Выступая в Мещанском суде Москвы экс-глава ЮКОСа заявил не совершал ничего противозаконного, в чем обвиняет его генпрокуратура России भारत सरकार ने आर्थिक सर्वेक्षण में र्र्वत्तीय र्वर्ि 2005-06 में सात फीसदी र्र्वकास दर हार्सल करने का आकलन र्कया है और कर सुधार पर ज़ोर र्दया है 日米連合で台頭中国に対処…アーミテージ前副長官提言 조재영 기자= 서울시는 25일 이명박 시장이 `행정중심복합도시'' 건설안 에 대해 `군대라도 동원해 막고싶은 심정''이라고 말했다는 일부 언론의 보도를 부인했다 Word-based Language Models •A model that enables one to compute the probability, or likelihood, of a sentence S, P(S) •Simple: Every word follows every other word w/ equal probability (0-gram) –Assume |V| is the size of the vocabulary V –Likelihood of sentence S of length n is = 1/|V| × 1/|V| … × 1/|V| –If English has 100,000 words, probability of each next word is 1/100000 = 00001 Word Prediction: Simple vs Smart •Smarter: probability of each next word is related to word frequency (unigram) – Likelihood of sentence S = P(w) × P(w) × … × P(w) 12n – Assumes probability of each word is independent of probabilities of other words •Even smarter: Look at probability given previous words (N-gram) – Likelihood of sentence S = P(w) × P(w|w) × … × P(w|w) 121nn-1 – Assumes probability of each word is dependent on probabilities of other words N-Gram Models •Estimate probability of each word given prior context –P(phone | Please turn off your cell) •Number of parameters required grows exponentially with the number of words of prior context •An N-gram model uses only N1 words of prior context –Unigram: P(phone) –Bigram: P(phone | cell) –Trigram: P(phone | your cell) •The Markov assumption is the presumption that the future behavior of a dynamical system only depends on its recent history In particular, in a kth-order Markov model, the next state only depends on the k most recent states, therefore an N-gram model is a (N1)-order Markov model Maximum Likelihood Estimate (MLE) •N-gram conditional probabilities can be estimated from raw text based on the relative frequency of word sequences nnwwC )()|(1 wwPBigram: 1nn nwC )(1 1n nwwC nNn )()|(11 wwPN-gram: Nnn 11n NnwC )(1 •To have a consistent probabilistic model, append a unique start ( ) and end ( ) symbol to every sentence and treat these as additional words Evaluation of language models •Perplexity and entropy: how do you estimate how well your language model fits a corpus once you’re done? •Smoothing and Backoff : how do you handle unseen n-grams? Statistical technologies •Language models •Collocations •Text classification •Information Retrieval •Machine Translation What is a Collocation? •A COLLOCATION is an expression of two or more words that correspond to some conventional way of saying things •The words together can mean more than their sum of parts •Examples of collocations –noun phrases like strong tea and weapons of mass destruction, hot dog, mother in law, disk drive –phrasal verbs like to make up, and other phrases like the rich and powerful •Valid or invalid? –a stiff breeze but not a stiff wind (while either a strong breeze or a strong wind is okay) –broad daylight (but not bright daylight or narrow darkness) Criteria for Collocations •Typical criteria for collocations: –non-compositionality –non-substitutability –non-modifiability •Collocations usually cannot be translated into other languages word by word •A phrase can be a collocation even if it is not consecutive (as in the example knock door) Non-compositionality •A phrase is compositional if the meaning can be predicted from the meaning of the parts –E g new companies •A phrase is non-compositional if the meaning cannot be predicted from the meaning of the parts –E g hot dog •Collocations are not necessarily fully compositional in that there is usually an element of meaning added to the combination Eg strong tea •Idioms are the most extreme examples of non- compositionality Eg to hear it through the grapevine Non-substitutability / Non-modifiability •We cannot substitute near-synonyms for the components of a collocation •For example –We can’t say yellow wine instead of white wine even though yellow is as good a description of the color of white wine as white is (it is kind of a yellowish white) •Many collocations cannot be freely modified with additional lexical material or through grammatical transformations (Non-modifiability) –E g white wine, but not whiter wine –mother in law, but not mother in laws Principal Approaches to Finding Collocations •How to automatically identify collocations in text? •Simplest method: Selection of collocations by frequency •Selection based on mean and variance of the distance between focal word and collocating word •Hypothesis testing •Mutual information Frequency •Find collocations by counting the number of occurrences •Need also to define a maximum size window •Usually results in a lot of function word pairs that need to be filtered out •Fix: pass the candidate phrases through a part of-speech filter which only lets through those patterns that are likely to be “phrases” (Justesen and Katz, 1995) Example of most frequent bigrams in an corpus Except for New York, all the bigrams are pairs of function words The most highly ranked phrases after applying the filter on the same corpus as before Collocational Window •Many collocations occur at variable distances A collocational window needs to be defined to locate these Frequency based approach can’t be used –she knocked on his door –they knocked at the door –100 women knocked on Donaldson ’ s door –a man knocked on the metal front door Mean and Variance •The mean  is the average offset between two words in the corpus •The variance s where n is the number of times the two words co-occur, diis the offset for co-occurrence i, and  is the mean •Mean and variance characterize the distribution of distances between two words in a corpus –High variance means that co-occurrence is mostly by chance –Low variance means that the two words usually occur at about the same distance Finding collocations based on mean and variance Ruling out Chance •Two words can co-occur by chance •High frequency and low variance can be accidental •Hypothesis Testing measures the confidence that co-occurrence was really due to association, and not just due to chance •Formulate a null hypothesis H0 that there is no association between the words beyond chance occurrences •Compute the probability p that the event would occur if H0 were true, and then reject H0 if p is too low (typically if beneath a significance level of p 0 is associated with each term ti of a document dj ∈ D dj = (w1, w2, , w|), jjV|j •For a term that does not appear in document dj, wij = 0 68 Boolean model (contd) •Query terms are combined logically using the Boolean operators AND, OR, and NOT –E g , ((data AND mining) AND (NOT text)) •Retrieval –Given a Boolean query, the system retrieves every document that makes the query logically true –Called exact match •The retrieval results are usually quite poor because term frequency is not considered 69 Sec 1 3 Boolean queries: Exact match •The Boolean retrieval model is being able to ask a query that is a Boolean expression: –Boolean Queries are queries using AND, OR and NOT to join query terms •Views each document as a set of words •Is precise: document matches condition or not –Perhaps the simplest model to build an IR system on •Primary commercial retrieval tool for 3 decades •Many search systems you still use are Boolean: –Email, library catalog, Mac OS X Spotlight 70 Strengths and Weaknesses •Strengths –Precise, if you know the right strategies –Precise, if you have an idea of what you’re looking for –Implementations are fast and efficient •Weaknesses –Users must learn Boolean logic –Boolean logic cannot capture the richness of language –No control over size of result set: either too many hits or none –When do you stop reading? All documents in the result set are considered “equally good” –What about partial matches? Documents that “don’t quite match” the query may be useful also Vector Space Model t3 d 2 d 3 d 1 θ tφ 1 d 5 t 2 d 4 Assumption: Documents that are “close together” in vector space “talk about” the same things Therefore, retrieve documents based on how close the document is to the query (i e , similarity ~ “closeness”) Similarity Metric •Use “angle” between the vectors:  dd kj )cos(  kdd j  n dd,  ikijikjww1, ddsim),(  nnkj 22  ikiijikjwwdd1,1,  •Or, more generally, inner products: n wwddddsim),(  ikijikjkj1,, Other similarity metrics Dicesimilarity (2 ww)  kiji,, i ddSim),( kj22 ww  kiji,, ii Jaccard similarity ( ww)  kiji,, i ddSim),( kj22 wwww)( *  kikikiji,,,, iii Vector space model •Documents are treated as a “bag” of words or terms •Each document is represented as a vector •However, the term weights are no longer 0 or 1 Each term weight is computed based on some variations of TF or TF-IDF scheme 75 TF - IDF •TF-IDF: term frequency-inverse document frequency –weight(t,d) = tf(t,d) * idf(t,D) –a numerical statistic that reflects how important a word is to a document in a corpus –used as a weighting factor in information retrieval and text mining •Variations of the TF-IDF weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query •TF-IDF can be successfully used for stop-words filtering in various subject fields including text summarization and classification TF - IDF TF - IDF Retrieval in vector space model •Query q is represented in the same way or slightly differently •Relevance of di to q: Compare the similarity of query q and document di •Cosine similarity (the cosine of the angle between the two vectors) •Cosine is also commonly used in text clustering 79 Stopwords removal •Many of the most frequently used words in a language are useless in IR and text mining – these words are called stop words –the, of, and, to, … –Typically about 400 to 500 such words –For an application, an additional domain specific stopwords list may be constructed •Why do we need to remove stopwords? –Reduce indexing (or data) file size •stopwords accounts 20-30% of total word counts –Improve efficiency and effectiveness •stopwords are not useful for searching or text mining •they may also confuse the retrieval system 80 Stemming •Techniques used to find out the root/stem of a word E g , –user engineering –users engineered –used engineer –using •stem: use engineer Usefulness: •improving effectiveness of IR and text mining –matching similar words –Mainly improve recall •reducing indexing size –combing words with same roots may reduce indexing size as much as 40-50% 81 Statistical technologies •Language models •Collocations •Text classification •Information Retrieval •Machine Translation Noisy Channel Model Sent Translation model Broken Language model Received Message P(f|e) Message P(e) Message in source in target in target language (f) language (e) language What hunger have I Statistical Hungry am I so Statistical Analysis I am so hungry Analysis Have I that hunger Bilingual Monolingual text text Ce foame am I am so hungry And much more… •Syntactic Parsing (structure of the sentence) •Semantic Labeling (who does what when how why ?) •Word Sense Disambiguation •Text to speech / speech to text •Sentiment analysis •… 